MEDB 5501, Module02

2024-08-27

Topics to be covered

  • What you will learn
    • Counts and percentages
    • Computing counts and percentages using R
    • Mean and median
    • Percentiles
    • Standard deviation
    • Computing means and standard deviations in R

Count the occurrences of the letter “e”.

A quality control program is easiest
to implement from the top down. 
Make sure that you understand the 
the commitment of time and money
that is involved. Every workplace is
different, but think about allocating
10% of your time and 10% of the 
time of all your employees to 
quality control.

A practical counting example

Image of a haemocytometer

Measurement error

  • Imprecision in a physical measurement
    • Example: GPS location
      • Can be off by up to 8 meters
      • Worse around large buildings
    • Other examples
      • Weight
      • Body temperature
      • Blood glucose

Reducing measurement error

  • Calibration
  • Consistent environment
  • Good equipment
  • Quality control
  • Training

Errors of validity

  • Mostly used for constructs
  • Types of validity
    • Criterion
      • Concurrent
      • Predictive
    • Content/face
    • Many others
  • Re-establishing validity

Errors of reliability

  • Synonym: repeatability(?)
  • Not reproducibility
  • Both physical measurements and constructs
  • Types of reliability
    • Test-retest
    • Inter-rater
    • Inter-method

Errors due to sampling

  • To be covered later
  • Easiest to quantify
  • Less important in era of big data

Break #1

  • What you have learned
    • Counts and percentages
  • What’s coming next
    • Computing counts and percentages using R

Data dictionary for sharing, 1

---
data_dictionary: sharing.xlsx
source: |
  Saginova, Olga (2020), “Dataset on the 
  questionnaire-based survey of sharing 
  services users’ motivation”, Mendeley Data,
  V1, doi: 10.17632/c5k8wjrhd9.1 

Data dictionary for sharing, 2

description: | 
  From the original source: "The data set 
  presents data collected by online survey 
  with a questionnaire using Likert scale. 
  The survey sample included 184 adults (18+),
  active and potential users of different 
  sharing services platforms."
  

Data dictionary for sharing, 3

copyright: |  
  CC By 4.0. You can share, copy and modify
  this dataset so long as you give appropriate
  credit, provide a link to the CC BY license,
  and indicate if changes were made, but you 
  may not do so in a way that suggests the 
  rights holder has endorsed you or your use
  of the dataset. Note that further permission
  may be required for any content within the
  dataset that is identified as belonging to a
  third party.
  

Data dictionary for sharing, 4

format: 
  proprietary (Excel)
varnames:
  first row of data
  
missing_value_code: 
  not applicable
  
size:
  rows: 184
  columns: 31

Data dictionary for sharing, 5

  age:  
    label: How old are you?
    values:
      - "18-25"
      - "26-35"
      - "36-45"
      - "46-60"
      - over 60

Data dictionary for sharing, 6

  gender:  
    values: 
      - F
      - M
    

Data dictionary for sharing, 7

  employment_status:  
    label: Are you employed?
    values:
      - employed
      - entrepreneur
      - full-time student
      - self-employed
      - temporarily unemployed
      - unemployed

simon-5501-02-sharing.qmd, 1

---
title: "Counts and percentages"
format: 
  html:
    slide-number: true
    embed-resources: true
editor: source
execute:
  echo: true
  message: false
  warning: false
---

simon-5501-01-template.qmd, 2

## Data source

This program uses data from a study of sharing services (like sharing an automobile) and produces counts and percentages for a few demographic variables. There is a [data dictionary][dd] that provides more details about the data. 

[dd]: https://github.com/pmean/datasets/blob/master/sharing.yaml

simon-5501-01-template.qmd, 3

## Libraries

Here are the libraries you need for this program.

```{r setup}
library(readxl)
library(tidyverse)
```

simon-5501-01-template.qmd, 4

## Reading the data

Here is the code to read the data and show a glimpse. There are 31 columns total, but I am showing just a few of the columns here.

```{r read}
fn <- "../data/sharing.xlsx"
sharing <- read_excel(fn)
glimpse(sharing[ , c(1, 5:7)])
```

simon-5501-01-template.qmd, 5

## Calculate counts and percentages for age group

```{r count-age-groups}
sharing |>
  group_by(age) |>
  summarize(n=n()) |>
  mutate(total=sum(n)) |>
  mutate(pct=100*n/total)
```

The survey respondents were younger than the general population. About half of the survey respondents were 18 to 25 years old. Only 3% were over 60. Six ages were missing.

Break #2

  • What you have learned
    • Computing counts and percentages using R
  • What’s coming next
    • Mean and median

Calculation of the mean and median

  • Mean
    • Add up all the values, divide by the sample size
  • Median
    • Sort the data
      • Select the middle value if n is odd
      • go halfway between the two middle values if n is even

Formal mathematical definitions

  • Mean
    • \(\bar{X}=\frac{1}{n}\Sigma X_i\)
  • Median
    • Sorted values \(X_{[1]},X_{[2]},...,X_{[n]}\)
      • \(X_{[(n+1)/2]}\) if n is odd,
      • \((X_{[n/2]}+X_{[n/2+1]})/2\) if n is even

Bacteria before and after A/C upgrade

  room before after
1  121   11.8  10.1
2  163    8.2   7.2
3  125    7.1   3.8
4  264   14.0  12.0
5  233   10.8   8.3
6  218   10.1  10.5
7  324   14.6  12.1
8  325   14.0  13.7

Calculation of the before mean

\(\frac{1}{8}(11.8+8.2+7.1+14+10.8+10.1+14.6+14)\)

\(= \frac{1}{8}(90.6)\)

\(= 11.325\)

The average colony count per cubic foot before remediation, 11.3, is quite large.

Calculation of the after mean

\(\frac{1}{8}(10.1+7.2+3.8+12+8.3+10.5+12.1+13.7)\)

\(= \frac{1}{8}(77.7)\)

\(= 9.7125\)

The average colony count per cubic foot after remediation, 9.7, is smaller, but still quite large.

Calculation of the median

  • Sort your data from low to high
  • Select the middle observation(s)
    • If n is odd
      • Choose the (n+1)/2 observation
    • If n is even(n/2) and (n/2 + 1) if n is even
      • Go halfway between (n/2) and (n/2 + 1) observation

Calculate the before median, 1

Here is the sorted data.

  room before
1  125    7.1
2  163    8.2
3  218   10.1
4  233   10.8
5  121   11.8
6  264   14.0
7  325   14.0
8  324   14.6

Calculate the before median, 2

Here are the middle two observations

  room before middle
1  125    7.1       
2  163    8.2       
3  218   10.1       
4  233   10.8   10.8
5  121   11.8   11.8
6  264   14.0       
7  325   14.0       
8  324   14.6       

Calculate the before median, 3

Average the two middle observations

  room before middle median
1  125    7.1              
2  163    8.2              
3  218   10.1              
4  233   10.8   10.8   11.3
5  121   11.8   11.8       
6  264   14.0              
7  325   14.0              
8  324   14.6              

Calculate the after median, 1

Here is the sorted data.

  room after
1  125   3.8
2  163   7.2
3  233   8.3
4  121  10.1
5  218  10.5
6  264  12.0
7  324  12.1
8  325  13.7

Calculate the after median, 2

Here are the middle two observations

  room after middle
1  125   3.8       
2  163   7.2       
3  233   8.3       
4  121  10.1   10.1
5  218  10.5   10.5
6  264  12.0       
7  324  12.1       
8  325  13.7       

Calculate the after median, 3

Average the two middle observations

  room after middle median
1  125   3.8              
2  163   7.2              
3  233   8.3              
4  121  10.1   10.1   10.3
5  218  10.5   10.5       
6  264  12.0              
7  324  12.1              
8  325  13.7              

Criticisms of the mean and median

  • Are you combining apples and onions?
  • Are you ignoring minorities?

Excerpt from Gould 1985 publication

Choosing between the mean and median

  • Often, either is fine
  • When do you use the mean?
    • When totals are important
    • “In 2020, the average expenditure by the Italian National Health Service (Servizio Sanitario Nazionale, SSN) per patient affected by at least one chronic disease was approximately 696 euros.”
  • When do you use the median
    • When outliers/skewness might distort your conclusions

Chen et al 2019

Chen 2019, PMID: 31806195 (continued)

Background: The prices of newly approved cancer drugs have risen over the past decades. A key policy question is whether the clinical gains offered by these drugs in treating specific cancer indications justify the price increases.

Chen 2019, PMID: 31806195 (continued)

Results: We found that between 1995 and 2012, price increases outstripped median survival gains, a finding consistent with previous literature. Nevertheless, price per mean life-year gained increased at a considerably slower rate, suggesting that new drugs have been more effective in achieving longer-term survival. Between 2013 and 2017, price increases reflected equally large gains in median and mean survival, resulting in a flat profile for benefit-adjusted launch prices in recent years.

Break #3

  • What you have learned
    • Mean and median
  • What’s coming next
    • Percentiles

Computing percentiles

  • Many formulas
    • Differences are not worth fighting over
  • My preference (pth quantile)
    • Sort the data
    • Calculate p*(n+1)
    • Is it a whole number?
      • Yes: Select that value, otherwise
      • No: Go halfway between
      • Special cases: p(n+1) < 1 or > n

Some examples of percentile calculations

  • Example for n=39
    • For 5th percentile, p(n+1)=2 -> 2nd smallest value
    • For 4th percentile, p(n+1)=1.6 -> halfway between two smallest values
    • For 2nd percentile, p(n+1)=0.8 -> smallest value

Some terminology

  • Percentile: goes from 0% to 100%
  • Quantile: goes from 0.0 to 1.0
    • 90th percentile = 0.9 quantile
  • 25th, 50th, and 75th percentiles: quartiles
    • 25th percentile: \(Q_1,\ X_{0.25}\) or lower quartile
    • Median/50th percentiles: \(Q_2\) or \(X_{0.5}\)
    • 75th percentile: \(Q_3,\ X_{0.75}\) or upper quartile

Calculate before remediation upper quartile, 1

Here is the sorted data.

  room before
1  125    7.1
2  163    8.2
3  218   10.1
4  233   10.8
5  121   11.8
6  264   14.0
7  325   14.0
8  324   14.6

Calculate before remediation upper quartile, 2

Calculate 0.75*(8+1) = 6.75. Select the 6th and 7th observations

  room before pick
1  125    7.1     
2  163    8.2     
3  218   10.1     
4  233   10.8     
5  121   11.8     
6  264   14.0   14
7  325   14.0   14
8  324   14.6     

Calculate before remediation upper quartile, 3

Average the two observations

  room before pick   q3
1  125    7.1          
2  163    8.2          
3  218   10.1          
4  233   10.8          
5  121   11.8          
6  264   14.0   14   14
7  325   14.0   14     
8  324   14.6          

Calculate after remediation upper quartile, 1

Here is the sorted data.

  room after
1  125   3.8
2  163   7.2
3  233   8.3
4  121  10.1
5  218  10.5
6  264  12.0
7  324  12.1
8  325  13.7

Calculate after remediation upper quartile, 2

Calculate 0.75*(8+1) = 6.75. Select the 6th and 7th observations

  room after pick
1  125   3.8     
2  163   7.2     
3  233   8.3     
4  121  10.1     
5  218  10.5     
6  264  12.0   12
7  324  12.1 12.1
8  325  13.7     

Calculate after remediation upper quartile, 3

Average the two observations

  room after pick    q3
1  125   3.8           
2  163   7.2           
3  233   8.3           
4  121  10.1           
5  218  10.5           
6  264  12.0   12 12.05
7  324  12.1 12.1      
8  325  13.7           

When you should use percentiles

  • Characterize variation
    • Middle 50% of the data
  • Exposure issues
    • Not enough to control median exposure level
  • Quantify extremes
    • What does “upper class” mean?
  • Quality control
    • Almost all products must meet a minimum standard

Break #4

  • What you have learned
    • Percentiles
  • What’s coming next
    • Standard deviation

Standard deviation

\[S = \sqrt{\frac{1}{n-1}\Sigma(X_i-\bar{X})^2}\]

At least one alternative formula.

Why is variation important

  • Variation = Noise
    • Too much noise can hide signals
  • Variation = Heterogeneity
    • Too little heterogeneity, hard to generalize
    • Too much heterogeneity, mixing apples and oranges
  • Variation = Unpredictability
    • Too much unpredictability, hard to prepare for the future
  • Variation = Risk
    • Too much risk can create a financial burden

Should you try to minimize variation?

  • Yes, for early studies
    • Easier to detect signals
    • Proof of concept trials
  • No, for later studies
    • Easier to generalize results
    • Pragmatic trials

Break #5

  • What you have learned
    • Standard deviation
  • What’s coming next
    • Computing means and standard deviations in R

Data dictionary for legionnaires, 1

---
data_dictionary: "legionnaire's disease"
format: 
  txt: tab-delimited
varnames: 
  first row of data
missing_value_code: 
  not needed
  

Data dictionary for legionnaires, 2

description: >
  Fictional data on bacteria counts before
  and after air conditioning maintenance.
additional_description:
  https://dasl.datadescription.com/datafile/legionnaires-disease
download_url:
  https://dasl.datadescription.com/download/data/3310
notes: >
  The use of a space in the first variable name might
  cause some minor difficulties during import.
  

Data dictionary for legionnaires, 3

source: >
  DASL (Data and Story Library), a repository for various
  data sets useful for teaching.
copyright: >
  Unknown. You should be able to use this data for
  individual educational purposes under the Fair Use
  guidelines of U.S. copyright law.
size:  
  rows: 8
  columns: 2

Data dictionary for legionnaires, 4

vars:
  Room number: 
    label: Hotel room number
  Before:
    label: Bacterial count before maintenance
    unit: colonies per cubic foot
    
  After:
    label: Bacterial count before maintenance
    unit: colonies per cubic foot
---

simon-5501-02-legionnaires.qmd, 1

---
title: "Univariate statistics for Legionnaires disease"
format: 
  html:
    slide-number: true
    embed-resources: true
editor: source
execute:
  echo: true
  message: false
  warning: false
---

simon-5501-02-legionnaires.qmd, 2

## Data source

This program uses data from a fictional study of Legionnaires disease and produces some simple univariate statistics: means, standard deviations, and percentiles. There is a [data dictionary][dd] that provides more details about the data. 

[dd]: https://github.com/pmean/data/blob/main/files/legionnaires-disease.yaml

simon-5501-02-legionnaires.qmd, 3

## Libraries

Here are the libraries you need for this program.

```{r setup}
library(tidyverse)
```

simon-5501-02-legionnaires.qmd, 4

## Reading the data

Here is the code to read the data and show a glimpse. There are 31 columns total, but I am showing just a few of the columns here.

```{r read}
fn <- "../data/legionnaires-disease.txt"
ld_raw_data <- read_tsv(fn, col_types="cnn")
glimpse(ld_raw_data)
```

simon-5501-02-legionnaires.qmd, 5

## Rename, 1

Notice how R encloses the first variable name (Room Number) in back-quotes. This is needed when a variable includes an embedded blank. You should rename this variable at your first opportunity.

```{r rename-1}
names(ld_raw_data)[1] <- "Room_Number"
glimpse(ld_raw_data)
```

simon-5501-02-legionnaires.qmd, 6

## Rename, 2

I find that many of the mistakes that I make are due to inconsistencies in how I name variables. Capitalization is one of the biggest problems. So I have gotten into the habit of converting variable names to all lower case. That way I don't have to worry about whether it is "Before" or "before". Here is the code to convert every capital letter to a lowercase letter.

```{r rename-2}
names(ld_raw_data) <- tolower(names(ld_raw_data))
glimpse(ld_raw_data)
```

simon-5501-02-legionnaires.qmd, 7

## Calculate means and standard deviations before remediation

```{r before-means}
ld_raw_data |>
  summarize(
    before_mn=mean(before),
    before_sd=sd(before)) 
```

The average colony count per cubic foot before remediation, 11.3, is quite large. The standard deviation, 2.8, represents a moderate amount of variation in this variable.

simon-5501-02-legionnaires.qmd, 8

## Calculate means and standard deviations after remediation

```{r after-means}
ld_raw_data |>
  summarize(
    after_mn=mean(after),
    after_sd=sd(after)) 
```

The average colony count per cubic foot after remediation, 9.7, is still quite large. The standard deviation, 3.2, represents a moderate amount of variation in this variable and is roughly comparable to the variation before remediation.

simon-5501-02-legionnaires.qmd, 9

## Calculate median and range before intervention

You could also use "median(before)" and  "min(before)" and "max(before)" in the code below.

```{r before-quantiles}
ld_raw_data |>
  summarize(
    before_median=quantile(before, probs=0.5),
    before_min=quantile(before, probs=0), 
    before_max=quantile(before, probs=1)) 
```

The median colony count before remediation, 11.3, is roughly the same as the mean. The data ranges from 7.1 to 14.6 colonies per cubic centimeter, a fairly wide range.

simon-5501-02-legionnaires.qmd, 10

## Calculate median and range after intervention

```{r after-quantiles}
ld_raw_data |>
  summarize(
    after_q50=quantile(after, probs=0.5),
    after_min=quantile(after, probs=0), 
    after_max=quantile(after, probs=1)) 
```

The median colony count, 10.3, is slightly lower after remediation. The data range from 3.8 to 13.7 colonies per cubic centimeter and is about as wide as the range before remediation.

simon-5501-02-legionnaires.qmd, 11

## Additional comments

The names that you choose for the left hand side of the equal sign are arbitrary. You should choose a descriptive name, but you have lots of options. A median of the before and after values could be called

-   Before_median, After_median
-   Median0, Median1
-   Second_quartile_A, Second_quartile_B
-   or many other reasonable choices.

simon-5501-02-legionnaires.qmd, 12

## Calculate a change score

For data like this with two measurements before and after an intervention, you should compute a change score. The way the computations are done below, a positive value means a reduction in colony counts. Note that any time you make a major change in a dataset, you should save it with a different name. That makes it easier for you to back up if you end up going down a blind alley.

```{r}
ld_raw_data |>
  mutate(change=before-after) -> ld_change_scores
glimpse(ld_change_scores)
```

Summary

  • What you have learned
    • Counts and percentages
    • Computing counts and percentages using R
    • Mean and median
    • Percentiles
    • Standard deviation
    • Computing means and standard deviations in R